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Abstract 

With the proUferation of modern high-resolution measuring instruments mounted on satel- 
lites, planes, ground-based vehicles and monitoring stations, a need has arisen for statistical 
methods suitable for the analysis of large spatial datasets observed on large spatial domains. 
Statistical analyses of such datasets provide two main challenges: First, traditional spatial- 
statistical techniques are often unable to handle large numbers of observations in a compu- 
tationally feasible way. Second, for large and heterogeneous spatial domains, it is often not 
appropriate to assume that a process of interest is stationary over the entire domain. 

We address the first challenge by using a model combining a low-rank component, which 
allows for flexible modeling of medium-to-long-range dependence via a set of spatial basis 
functions, with a tapered remainder component, which allows for modeling of local depen- 
dence using a compactly supported covariance function. Addressing the second challenge, we 
propose two extensions to this model that result in increased flexibility: First, the model is pa- 
rameterized based on a nonstationary Matem covariance, where the parameters vary smoothly 
across space. Second, in our fully Bayesian model, all components and parameters are con- 
sidered random, including the number, locations, and shapes of the basis functions used in the 
low-rank component. 

Using simulated data and a real-world dataset of high-resolution soil measurements, we 
show that both extensions can result in substantial improvements over the current state-of-the- 
art. 

Keywords: Covariance Tapering; Full-Scale Approximation; Low-Rank Models; Massive 
Datasets; Model Selection; Reversible-Jump MCMC 

1 Introduction 

From remote sensing of environmental variables using satellite instruments to proximal sensing of 
soil properties using a ground-based gamma-radiometer, a vast number of spatial measurements 
are now being obtained every day. Based on such very large, noisy, nongridded, and incomplete 
datasets, the goal is spatial prediction of a process of interest, together with rigorous quantifica- 
tion of prediction uncertainty. Computational feasibility for such datasets has been addressed from 
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several angles: Approximations by Gaussian Markov random fields (e.g., |Lindgren et al.[ |2011[), 



composite likelihoods (e.g., |Lindsay[ |1988[ |Curriero and Lele[ |1999[ [Bevilacqua et al.[ |2012[ [ET 
dsvik et aL||20121), covariance tapering, and low-rank models. We focus here on the latter two. 



Covariance tapering ( Furrer etaL||2006[ [Kaufman et al .l,l2008i |Shaby andRuppertl|2012| ) relies 
on compactly supported correlation functions (e.g., Gneiting, 2002 ) to produce sparse covariance 
matrices containing only a moderate number of nonzero elements. Use of efficient sparse-matrix 
algorithms then may result in computational feasibility for large datasets. However, by definition, 
covariance tapering is most appropriate for modeling processes with weak long-range dependence. 

A second approach to achieving computational feasibility for large spatial datasets is through 
low-rank models, which include a component that can be written as a linear combination of spatial 
basis functions, 

E]=M-)vj=H-yv, (1) 

where ?7|W ~ A^r(0, W), and the number of basis functions, r, is much smaller than the num- 
ber of observations, n. Many models that include such a component have been proposed (for a 
recent overview, see IWiklef 2010j). The models differ in the parameterizations and priors for the 
covariance matrix W and the functions in b(-). For discretized convolution models (i.e., convo- 
lution models whose integrals are discretized; see, e.g., IHigdonf [1998^ |Calder[ |2007[ |Lemos and 
Sanso, 2009), b(-) contains the convolution kernels, and W is often assumed to be a multiple of 
the identity. Other authors view b(-) as a vector of fixed basis functions, such as empirical or- 
thogonal functions (e.g. Mardia et al.[|1998 , Wikle and Cressie' '1999"), equatorial normal modes 



(e.g., |Wikle et al.[[2001| ), Fourier basis functions (e.g., |Xu et al.p,2005 ), W-wavelets (e.g., |Shi and 



Cressie[ |2007t [Cressie et al.[ |2010[ |Kang and Cressie[ |2011| ), or bisquare functions (e.g., [Cressie 



and Johannesson 2008 



Katzfuss and CressTe] , |2011[ |2012p . Here, we use the predictive -process 
approach ( [Banerjee et al. . 2008 ), where both b(-) and W are parameterized according to a "parent 
process," for which a parametric covariance model is chosen. 

Models with low-rank components ([T]) allow for fast computation via the Sherman-Morrison- 
Woodbury formula, as is made clear in |Cressie and Johannesson| ( [2006p and |Shi and Cressie] ( |2007[ ). 
For general W, they are also flexible, in that the covariance of ([1]), namely b(si)'W b(s2) for lo- 
cations Si and S2, is not of traditional parametric form. The fast computation and the flexibility 
make components of the form ([T]) very well suited to modeling medium-range to long-range spatial 
dependence. However, due to the dimension reduction inherent in ([T]), a low-rank component alone 



is typically not able to model "rough" (i.e., non-smooth) short-range dependence (see, e.g.. Stein 



2008[[Finley et al.[|2009| ). Some efforts have been made to address this problem (e.g., |Wikle and 



Cressiel \T999[ [Berliner et aL| |2000t [Wikle et a\.\ [200T| [Stemj [2008] ), including in the context of the 



predictive process (Katzfuss', '20 IjJ ch. 4; Sang et al. , 2011[ Sang and Huang[ 2012p . Here, we fol- 
low the approach of San g and Huang| (2012), who divide a parent process into a predictive -process 
component and a remainder component. The covariance matrix of the remainder component is 
then made sparse by multiplication of its covariance function with a compactly supported tapering 
function. This approach allows for computationally feasible inference, even for large datasets. 

The contributions of this article are two extensions of the approach by Sang and Huang| (2012), 
which allow for more flexibility and nonstationarity. First, we specify a nonstationary Matern 
model ( Paciorek and Schervish[ 2006, Stein[ [200 5 ) for the parent covariance, in which the param- 
eters vary smoothly across space as hnear combinations of spatial basis functions. 

The second extension is that we allow the set of basis-function locations (henceforth referred 
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to as "knots") in our low-rank component to be a random point process. This allows us to avoid 
choosing an arbitrary and fixed set of knots a priori. Here, b(-), rj, and W in ([T]) are all treated as 
unknown and random. This Bayesian source separation task (see, e.g., Knuth[ [2005 ), where both 
the "source signal" rj and the "mixing coefficients" b(-) have to be estimated, can be achieved by 
putting a prior on both components. This has been done in the context of discretized-convolution 
models by Lemos and Sanso ( [2009 1, who infer (spatially varying) parameters determining the 
shapes of their kernels. Lopes et"aL] ( |2008| ) also consider a model of the form ([T]) where both 
b(-) and rj are random, but as each basis function is itself a Gaussian process, their approach is 
infeasible for large spatial datasets. Recently, Guhaniyogi et al. (2011) also proposed a predictive- 
process model where the locations (but not the number) of the basis functions are assumed random. 
In this article, we implicitely make inference on the number, locations, and shapes of the basis 
functions. Our approach is a special case of that in |Katzfuss| ( |2011[ ch. 4) and is inspired by 
Holmes and Mallick ( 2001| ), who propose a piecewise linear spline regression model for which 



both the number and the locations of the splines are random. 

A third contribution of this article is partially philosophical in nature: We do not consider the 
parent process to be the truth that is to be approximated, but rather as a way of obtaining a prior 
for the two spatially dependent components in our model. The resulting process is more flexible 
than the parent process, and hence it is often preferable for modeling nonstationary real-world 
processes. 

Posterior inference for our model is described in detail. It is fairly involved but computationally 
feasible, even for very large datasets. A reversible-jump Markov chain Monte Carlo algorithm 



(Green 19951 allows us to infer the number of basis functions. We take advantage of sparse- 



matrix operations to ensure fast computation, and we employ marginalization strategies (e.g.. 



van 



Dyk and Park, 2008 1 to achieve satisfactory mixing of the Markov chain. 



This article is organized as follows: In Section |2} we introduce our nonstationary spatial model 
based on the model of [Sang and Huang ( 2012[ ). Section [3] deals with posterior inference on the un- 
known quantities in the model. In Section |4} we assess the effect of our extensions to the approach 
of Sang and Huang (2012j), using simulated data and a real-world dataset of soil measurements. 
Conclusions are given in Section [5| 



2 Methodology 

2.1 A Standard Spatial Statistical Model 

Let {l^(s) : s G V}, or Y{-), denote the process of interest on a spatial domain V C M"^, d E 
{1, 2, 3}. Suppose that at n locations we have observations on Y{-), namely ^(si), . . . , Z(sn), 
where n is very large, and we assume additive measurement error: 

Z(si) := F(s,) + e(s,), i = l,...,n, (2) 

where e(-)|(T^ ~ GWN(0, cr^) is Gaussian white noise and independent of Y(-). For simplicity 
and to ensure identifiability, throughout this article we will assume that is fixed and known. In 
practice, if is not known (e.g., from instrument experiments), it can be estimated from the data 



by extrapolating the variogram to the origin as described in jKang et al.| ( |2009| ). 
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In spatial statistics, the process model is often given by, 



F(-) :=/i(-)+^(-), 



(3) 



where yu(-) := x(-)'/3 is the large-scale trend, /3 has an (improper) flat prior on W, and w(-) is a 
spatially correlated component, which is typically modeled as a Gaussian process, 

a;(-)|6'~GP(0,Cp), (4) 

with mean zero and covariance function 

Cp(si, S2) = o-(si)cr(s2) Pp(si, S2), Si, 83 G V, (5) 
where a: V ^ Mq and the correlation function pp : (P x D) — > [— 1, 1] are parameterized by 0. 



2.2 A Low-Rank Component with Random Basis Functions 



While the standard spatial model described in Section 2.1 has been used extensively and success 



fully (see, e.g., Banerjee et al. 2004), it is computationally infeasible if n is very large (more than 
10,000 or so) and Cp is a standard covariance function (e.g., the exponential covariance function). 
This is because it takes on the order of computations to evaluate the likelihood. 

Many approximations or modeling approaches have been proposed to solve this problem (see 
Section[T]). We will focus here on the predictive process ( Banerjee et al.[ 2008[ ). Given a so-called 
"parent process" a;(-) as in (|4]), the predictive process is defined as, z/(-) := E{uj{-)\co(ki), . . . , a;(k,.)), 
where 

/C := {ki, . . . , k,}, with k, G P, J = 1, . . . , r, (6) 



is a set of knots. Conditional on 6 and /C, the predictive process can be written as a linear combi- 
nation of basis functions, namely as z/(-) = b(-)'T7 with rj ~ Nr{0, W), where now 



b(s) := a{s) (pp(s, ki), . . . , pp(s, k^))', seV, 



(7) 



W := ((pp(kj, kj))jj=i^...^r) ^ Thus, we have z/(-)|0, /C ~ GP(0, Cj,), where C,y(si, S2) 

b(Si)'Wb(s2),Si,S2GP. ' 

In what follows, we do not choose a fixed set of knots /C in (|6]). Instead we model /C as a 



random point process. As discussed later at the end of Section 3.2, it is not necessary to strongly 
penalize large numbers of basis functions, r, through the prior on /C. Thus, we assume a flat, 
noninformative, improper prior for }C with density proportional to 1. 



2.3 Adding a Tapered Remainder Component 



It was pointed out by Finley et al.'('2009) that the predictive process can only account for smooth 
dependence. Hence, as in Sang and Huang] ( [2012( ), we write: 



(8) 
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Then S{- 



is independent of z/(-), and ^(•) ~ GP(0, Ci), where C^(-, ■) = Cp(-, •) 



OA 



Huang 



is a va lid covariance function. To achieve computational feasibility for large n. Sang and 
(|2012|) proposed to replace in ^ by ~ GP(0, where 



C^5(Sl,S2) =r(||Si- 8211/^)^5(81,82), 81,82 G I? (9) 

is a tapered version of C-^. In (|9]), T(-) is a compactly supported correlation function (see, e.g.. 



Gneiting 2002 ) that is equal to zero when its argument is greater than one. Multiplication of 
with T achieves that (7^(81, 82) = Oif ||8i — 82II > L, resulting in a covariance matrix that is sparse 



and quickly invertible (see Section 3.4 below). We will assume the tapering length L to be fixed 
and chosen to ensure computational feasibility. 

In summary, our data model is given by ([2]), and our process model is given by 



(10) 



where z/(-) describes the medium-range to long-range spatial dependence, and (5(-) accounts for 
local (or short-range) dependence. Both z/(-) and 5(-) are zero-mean Gaussian processes, whose 
covariance functions depend on a random set of knots, /C, with a flat prior distribution, and on a 



parent covariance function, Cp, parameterized by and specified further in Section 2.4 below 



2.4 The Parent Covariance Function 

Let M ^j denote the Matern correlation function ( |Stein[ [T999| p. 50), 

M^{h) = {2hV^yic^{2hV^)2^'yr{v), h>o, 



(11) 

and AiviO) = 1, where /C^, is the modified Bessel function of the second kind of order v > 0. 
Also, let 

g(8i, 82) = {2(81 - 82)'(Sa(8i) + S^(82))-^(8i - s^)}'/^ 81, 82 G , G N, (12) 

be a spatially varying (SV) Mahalanobis-like distance, where Sa(s) is a d x d positive-definite 
matrix describing (local) geometric anisotropy at location s. We write, 5]yi(8) := IZ^s) T(s) 7?.(8)', 
where r(8) := diag{'yi{s), . . . , 7^(8)}, {7^ : V — )■ M"*", j = 1, . . . ,d} are SV scale parameters, 
and 7?. is a rotation matrix parameterized by SV rotation angles {kj : P — )■ [0, 7r/2], j = 1, . . . , d — 



1}. A valid nonstationary Matem correlation function ( Paciorek and Schervish| 2006; Stein 2005 1 
is given by. 



A^(8i,82) = c(8i,82)A^ 



ivisi)+v{s2))/ 



2(9(81,82)), 81,82 G G N, 



(13) 



where c(8i, 82) := |S^(8i)|V4|s^(82)| V4|(Sa(8i) + Sa(82))/2|-V2. 
Choosing pp := Ai in ^ results in the parent covariance 

Cp(8i, 82) = a(8i)a(82) M{si, 82), 81, 82 G I) C G {1, 2, 3}. 



(14) 



This nonstationary Matern class is very flexible, in that it allows for SV standard deviation cr(-), 
SV smoothness parameter v{-), and SV geometric anisotropy through SV scale parameters {7j(-) : 
j = 1, . . . , and SV rotation angles {Kj(-) : j = 1, . . . , ci — 1}. 
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To ensure computational feasibility, we let the parameters vary spatially according linear com- 
binations of spatial basis functions. We assume that all SV parameters are determined by the 
(random) parameter vector, := (a, rj'^, v, t}'^, 7', 77^, k', rj'^)', through models of the form, 

e{s)=ge{e + h0{syr,0), seV, (15) 

where 9{-) is a generic notation for one of the SV parameters, 9 ~ N(iig, aj), rje ~ Nrg{0, TglrJ, 
and hg{-) is an r^-dimensional vector affixed basis functions (same for all parameters), each nor- 
malized to [0, 1]. The functions ge{-) are transformations from M to the range of ^(■). 



Table 1 : Details for the S V covariance parameters of the form ([15]) 



Parameter 


Symbol ^(O 


Range of ^(^ 


Transformation g0{-) 


Ate 




Standard deviation 


a{.) 


M+ 


exp(-) 


(*) 


al = 0.25 


Smoothness 


v{-) 


[0,2] 




Atw = 


^^ = 1 


Scale 




M+ 


exp(-) 


(*) 


a2 = 0.25 


Rotation angle 




[0,7r/2] 


(vr/2)$(-) 


At. = 


^1 = 1 



$(■): Cumulative distribution function of the standard normal distribution. 

(*): The prior means At^ and At7 depend on the application; see Section|4]for specific choices. 

Specific choices for gei-), jj^e, and are given in Table [T| For example, we have a{s) = 
exp(cr -I- he{s)'r]f^), a ~ N{^„,al. = 0.25), and r]„ ~ Nrg{0.,Tj\rg). Note that we restrict the 
smoothness parameter v{-) to the interval [0, 2], as "the data can rarely inform about smoothness of 
higher orders" (Banerjee et al. 2008 1. The parameter determines how much 0{-) is allowed to 
vary a priori over V; we set rj = (0.25)^ for all SV parameters (see'Katzfuss and Cressie, 2012), 
inducing shrinkage towards stationarity for the covariance function Cp. 

For hg(-) in ( [T5| ), any choice of basis functions is possible. Assuming that the covariance 
parameters vary smoothly over space, we choose a relatively small number of power exponential 
correlation functions, hQ{s) = (exp{ — ((s— Ci)/A)^}, . . . ,exp{ — ((s— Crg)/A)^})', with (relatively 
large) fixed scale parameter A, and fixed centers Ci, . . . , c^g . Specific choices depend on the domain 
V and are given in Section |4} 

Our choice for T in (|9]) in this article is Kanter's function (|Kanter[|T997]): 



r(x) 



sin(27ra;) , 1— cos(27ra;) 



2nx 



XG (0,1), 



(16) 



T(x) := for X > 1, and we set T(0) := 1. The function T(||h||) is positive-definite for h G M^, 
it is twice differentiable at the origin, and it minimizes the curvature at within the class of all 
compactly supported and valid (in M'^) correlation functions (Gneiting 2002). 



Li summary, for fixed f3, /C, and 6, the covariance function of the true process F(-) in ( [T0| ) is 

Cy(si, S2) = a(si, S2) + r(||si - S2||/L){Cp(si, S2) - a(si, S2)}, Si, S2 G V, (17) 

where and T are given by Q and ([16]), respectively. It follows immediately from Proposition 1 
in [Sang and Huang (2012) that this covariance function is positive definite. It is a close approxima- 
tion to Cp{-, ■) for a large, dense set of knots, /C (or for large L). Here, because /C is random, ( [TT] ) 
is more flexible than the parent covariance and hence preferable in many nonstationary real- world 
situations. Note that, because cr(-) is infinitely differentiable, T(-) is twice differentiable at the 
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origin, and A1(s, s + h) is also at most twice differentiable for v{s) < 2 (see also 



Paciorek and 



Schervish 2006 1, the smoothness of Y{-) at location s G P is solely determined by v{s) (for fixed 
f3, /C, and 0). 



3 Posterior Inference 



3.1 Summary of the Model in Vector Notation 

Integrating out rj and 6{-), the data, Z := (Z(si), . . . , Z(s„))', are distributed as, Z|r2 ~ A^„(X/3, Sz), 
where := {/3, 0, /C}, and the i-th row of X is given by x(s,;)'. The data covariance matrix is. 



Sz := var{Z\n) = BWB' + V, 



(18) 



where the z-th row of the n x r matrix B is given by b(s,;)' (see Q), W is defined below (|7]), 
V := + Ve, := a^In, and := {Cs{si, Sj))ij=i_...^„ is the sparse n x n covariance matrix 
of the vector S := (5(si), . . . , 5(s„))' (see 

In what is to follow, [A] will denote the distribution or density of a generic random variable A, 
and [A \ . . .] will denote the full conditional distribution of A (i.e., the distribution of A given the 
data and all parameters other than A in Q). Further, let Nk{sL\fj,, S) denote the probability density 
function of a A;-variate normal distribution with mean and covariance matrix S, evaluated at a. 

The densities of the full conditional distributions of the elements of f2 are all proportional to, 

[z,n] = [z\n][n] = iv„(z|x/3,Sz)[/3][0][/c], 



where [/C] oc 1, [/3] oc 1, and [6] is described below ( [T5] ). 



3.2 The Reversible-Jump MCMC Algorithm 

For posterior inference, we will employ a reversible jump Markov chain Monte Carlo (MCMC) 



algorithm (jGreen|[l995j) based on a Gibbs sampler ( |Geman and Geman||1984[ ) with some adaptive 
MetropoUs-Hastings steps ( [Metropolis et al.[ |1953[ [Hastings! [1970[ jHaario et aL| |2001| ). We will 
emphasize dependence of on a set of parameters by placing the parameters in parentheses. 
The MCMC sampler consists of the following steps: 

1. Sample /3 from, [/3| . . . ] = A/p(/3|(X'S^^X)-iX'S^^Z, (X'S-^X)-^). 

2. Sample 6 using a MetropoUs-Hastings step from, [6\ . . .] oc [6] (Z |X/3, Sz (^)) . 

3. Sample a new set of knots from [/C| . . . ], as follows. At each MCMC iteration, we propose 
one of three modifications to the current set of knots, each with probability 1/3: 

(a) Add a knot: Draw a new knot, k^+i, from a uniform distribution on V, and let /C* : = 
/C U {kf+i} be the proposed set of knots, which now has size r* = r + 1. 

(b) Deleteaknot: Select one knot uniformly at random from /C; that is, draw J ~ f/(l,2,... 
Then set /C* := /C\{kj} and r* = r - 1. 



r). 
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(c) Moving a knot (a combination of (a) and (b)): First select a knot uniformly at random 
to be deleted, and then select a location uniformly on V at which to add a new one (i.e., 
where to move the old knot). This results in /C* := {k^+i} U /C\{kj} and r* = r. 



The reversible-jump acceptance probability (Green 1995 1 for the proposed /C* can be shown 
to be equal to min{l, a], where 

iV„(Z|X/3,S^(/C*))Q(/C*,/C) 



a :-- 



iV„(Z|X/3,Sz(/C)) Q(/C,/C*)' 



(19) 



and the proposal ratio is given by, 

Q^(/C*,/C) 



l/(r + l), r* = r + l 
r, r* = r — 1 

1, r* = r. 



(20) 



Q^(/C,/C*) 

Note that for r = 0, deleting or moving a basis function is impossible, and so in this case we 



always propose to add a basis function. As a result, the proposal ratio in ( |20l ) is given by 1/3 
when r = 0. 

There might be a concern that, for very large datasets, the data always favor a very large number 
of basis functions, unless there is strong penalization for large r through the prior distribution on /C. 
However, note that the acceptance probability ([19]) for a proposed set of knots, /C*, is the product 
of the Bayes factor (of /C* versus /C) and a term depending only on the proposal distribution chosen 
for /C* (cf. Holmes and Mallick |2000[ , App. I). This is reassuring, as "the Bayes factor functions as 
a fully automatic Occam's razor" ( |Kass and Raftery[[1995 t p. 790), and so there is strong intuition 
that our flat prior, [/C] oc 1, is sufficient and that no explicit penalty for large r is necessary. 



3.3 Spatial Prediction 



In spatial statistics, the main interest is typically in making inference on the true process Y 
a set of prediction locations, {sf , . . . , s^^}, which might or might not include the set of obs( 
locations. We write, = + B^rj + S^, and so we need samples from. 



(21) 



where samples of the first term on the right-hand side were obtained in Section 3.2 Because it 
can be very computationally expensive, we only obtain samples of rj and for thinned MCMC 
iterations after convergence of the MCMC for (see van Dyk and Park 2008 for why this is 
valid). We have 



r]\n, Nr ((B'V-^B + W-^)-^B'V-^(Z - X/3), (B'V-^B + W"^)"^) 



and 



s^\'n,n,z 



where 



V 



[Z -X/3-Br7),V 



(22) 



np 

:= cof (5^, (5), and 5 := (5(si), . . . , (5(s„))'. After appropriate 
[5', 6^']' , where 6^ denotes 5{-) evaluated at all unobserved prediction 

rP,O^T-l^TP,0, 



var{6^) 
reordering, we write 

locations. To avoid having to obtain V^'^^V^^V^ '^' explicitly, we obtain a sample from ([22]) by 



calculating the quantity, 6^ + Vf'*^V"^( 



5 

X/3 



B77 



where 6^ := {6', 6 



Ur\r 



iVnp(0, V|) and e ~ A^„(0, V,) (cf. conditional simulation, jCressieJ [1993| Sect. 3.6.2). 
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3.4 Computational Issues 



Note that in ( fTS] ) is a dense n x n matrix of full rank n, and so naive calculation of its inverse, 
which appears in the MCMC updates, is computationally infeasible for large n. However, we 
can employ the Sherman-Morrison - Woodbury formula ( Sherman and Morrison[ 1950; Woodbuiyl 
m\ [Henderson and Searle| |198 1| ) to obtain, S^^ = V'^ - V -^B(W-^ + B^V-^B)-^B^V-\ 
and a similar formula gives, = |V||Ir + WB'V^^B| (e.g.,|Cressie and Johannesson 2008). 
Since the tapering range, L in (|9]), is fixed, the position of the nonzero elements of V is the same 
for all MCMC iterations. Hence, we order the locations to allow for efficient Cholesky decompo- 
sition of V (e.g., using the minimum-degree ordering) only once, at the beginning of the MCMC 
algorithm. 

In general, the number of computations required for operations involving a sparse matrix de- 
pends on the number and locations of the nonzero elements (Gilbert et al.[ 1992[ ). Some numerical 
results in |Furrer et al.| ( |2006| ) indicate that the time required to compute the Cholesky decomposi- 
tion of a tapered nxn covariance matrix increases roughly linearly with n (for fixed domain, fixed 
tapering length, and a regular sampling grid), which in turn indicates that the computational com- 
plexity of our algorithm is approximately of order n. Questions about theoretical computational 
complexity aside, in our experience the majority of computation time at each of the MCMC itera- 
tions was actually spent on evaluating the modified Bessel function in ( [TT] ) for each of the nonzero 
elements of the matrix V5 (and of Vf and V^'*^ for iterations in which spatial predictions are 
obtained). We have considerable control over the speed of the MCMC algorithm through selection 
of the tapering range, L in ([9]). For extremely massive datasets, we can set L to a very small value 
to achieve computational feasibiUty. 



4 Numerical Model Comparisons 



In this section, we will compare our model to the model of Sang and Huang ( 2012[ ), which rep- 
resents the current state-of-the-art in terms of geostatistical approaches to the analysis of large 
spatial datasets. |Sang and Huang ( 2012| ) showed that their model can result in better predictions 
and model fit than the predictive-process approach of Banerjee et al. (2008 ). Our approach can be 
viewed as an extension of the [Sang and Huan^ ( 2012[ ) model in terms of two components: random 
knots and the use of the nonstationary Matem covariance of Section 2.4 as the parent covariance 
function. Therefore, our comparisons will examine the effects of two factors: random versus (a 
varying number of) fixed knots, and a nonstationary Matern parent covariance (NPC) versus a 
stationary one (SPC). (SPC is a special case of our model obtained by setting r| = in ([15]).) 



4.1 Simulation Studies in One Spatial Dimension 

For the following three simulation studies, the true process is assumed to exist on a one-dimensional 
domain, V = [1, 512], with potential measurement locations at {1,2,.. .,512}. 

In Simulation Study 1, we assumed that the true process Y(-) is a deterministic function: 



Y{s) = 1 + sin (27T (^)') sin (2O7T (^)' 
This true process is shown in Figure [T] 



s eV. 



(23) 
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50 100 150 200 250 300 350 400 450 500 



Figure 1: For Simulation Study 1, the true process F(-) and one sample of the data Z, together 
with the posterior mean and a point- wise posterior 90% credible interval (CI) of F(-), the posterior 
mean (PM) of /i(-) + z/(-) (i.e., without 5{-)), and the density of the knots (histogram at the bottom) 
using our model 



Based on this true process, we created 100 datasets of observations by adding independent 
normal measurement error with variance a"^ = cxy ■ 5% = 0.004, where ay = 0.08 is the empirical 
variance of {Y{1), . . . , F(512)}. To examine the medium-to-long-range prediction performance 
of the models, we created four test intervals, in which no data was observed (collectively referred to 
as missing by design, or MBD). These test intervals each have length 25 and begin at locations 70, 
198, 326, 454, respectively. In addition, one third of the remaining locations (henceforth referred 
to as missing at random, or MAR) were selected at random at each iteration of the simulation study 
as unobserved test locations (to test short-range prediction performance near observed locations). 
The remaining 275 observed locations will be denoted OBS. 

To ensure comparability of the results, we assumed the measurement-error variance to be 
known for all models. For each of the 100 simulated datasets, each of the models was run for 
10,000 MCMC iterations (thinned by a factor of 10), the first 5,000 of which were taken as burn- 
in (based on examination of trace plots in a pilot study). The tapering length in (|9]) was chosen 
as L = 6.5, resulting in about 2,400 nonzero elements (less than 9 per row) for V in ( [T8| ). The 
prior distributions of the parameters of the parent covariance were as described in Section [24} with 
/icr = log(o'y) and /i^ = log(3000). The spatial trend, yu(-) in Q, consisted only of an intercept 
(i.e., x(-) = 1). 

For the random knots, the proposal distribution for new knots (as described in step 3 of Section 



3.2) was a uniform distribution on [—9,522]. As a pilot study using our model showed that the 
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posterior mean of r was around 11, we used two sets of fixed knots: The first consisted of eight 
evenly spaced knots between locations -10 and 522, and the second set consisted of 14 evenly 
spaced knots between -4 and 516. For the models with nonstationary parent co variance, we took 
he(-) in ([15]) to be made up of four power exponential functions with scale parameter A = 74, 
centered at locations 64, 192, 320, 448, respectively. One set of observations, Z, is shown in 
Figure [T| together with a summary of the corresponding results using our model. Very few knots 
are selected between locations 380 and 500, because the process fluctuates so quickly in that area 
that it can basically be picked up entirely by the tapered remainder component, 5{-). 

To measure prediction accuracy of the models under consideration, we used the mean squared 
prediction error, the squared difference between the true process Y(-) and the posterior mean for 
each of the models. To quantify the accuracy of the uncertainty estimation, we also calculated 
the interval score, which combines the width of a credible interval (here, 95% posterior credible 
intervals) with a penalty for not containing the true value (see [Gneiting and Raftery[ [2007 Sect. 
6.2). The goal is for a small interval score. Both mean squared prediction error and interval score 
were averaged over the 100 simulated datasets and all 512 locations (ALL), and also averaged 
within each of the groups of locations described earlier (OBS, MAR, MBD). 



Table 2: Results of Simulation Study 1 



Parent covariance 


Random knots 
NPC SPC 


8 Fixed knots 
NPC SPC 


14 Fixed knots 
NPC SPC 


Full model 
NPC SPC 


Time (min) 


3.64 3.79 


2.01 2.01 


2.76 2.73 


101.17 100.05 


MSPE (ALL) X 100 
MSPE (OBS) xlOO 
MSPE (MAR) xlOO 
MSPE (MBD) XlOO 


1.00 1.02 
0.11 0.11 
0.24 0.25 
4.48 4.58 


1.19 1.25 
0.18 0.21 
0.51 0.57 
4.87 5.05 


1.38 1.56 
0.13 0.19 
0.36 0.48 
6.24 6.80 


1.48 1.57 
0.14 0.18 
0.23 0.28 
6.88 7.17 


IS (ALL) X 100 
IS (OBS) XlOO 
IS (MAR) X 100 
IS (MBD) X 100 


26.71 28.45 
15.54 15.77 
21.64 21.93 
64.40 72.26 


33.16 47.28 
20.11 21.77 
33.01 38.56 
69.23 129.36 


45.65 72.84 
17.11 20.99 
27.16 36.52 
149.43 265.21 


62.94 80.60 
18.30 22.42 
23.92 29.77 
239.18 310.26 


Posterior mean of r 


10.59 11.24 


(8) (8) 


(14) (14) 


(275) (275 ) 



NPC = nonstationary parent covariance; SPC = stationary parent covariance; MSPE = mean 
square prediction error; IS = interval score; Time = average time for each MCMC (averaged over 
the 100 simulated datasets); ALL = all 512 locations; OBS = the 275 observed locations; MAR = 
the 137 missing-at-random locations; MBD = the 100 missing-by-design locations 



The results for Simulation Study 1 are shown in Table |2] Two trends are evident in terms 
of both scores: Using a NPC produced better predictions than using a SPC, and random knots 
resulted in better predictions when compared to fixed knots. The more fixed knots were used (we 
even included the full model, for which r = n), the closer the resulting models were to their 
respective parent processes (which are clearly the wrong models for F(-) in ([23])), and the worse 
the scores were for the test regions (MBD). 

Throughout this article, we assumed that most real-world processes do not have covariances 
of simple parametric form. To examine the performance of our model in the unlikely event of 
encountering a process that does exhibit simple parametric covariance, we conducted two more 
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simulation studies. In Simulation Studies 2 and 3, we sampled a new true process F(-) 100 times 
each as a constant spatial "trend" equal to 1 plus a mean-zero Gaussian process component with 
the Matern covariance function of ([14]). For Simulation Study 2, we chose 

a{s) = 3exp (sin((l - |s/256 - 1|) 27r)/2) 

= 600 exp ( - 2 sin(s 27r/256)) (s/256) (24) 

v{s) =3$(-sin(s27r/256)), 

and for Simulation Study 3, we used a stationary Matem covariance with a{s) = 3, 7(5) = 600, 
and v{s) = 1. At each of the 100 iterations, we then simulated data, Z, by adding a spatially 
independent measurement-error term with variance cr^ = 3^ ■ 5% = 0.45 at each observed location. 
The remaining setup was exactly the same as in Simulation Study 1, except that we chose = 
log(3) and /i^ = log(600). 



Table 3: Results of Simulation Study 2 



Parent covariance 


Random knots 
NPC SPC 


8 Fixed knots 
NPC SPC 


14 Fixed knots 
NPC SPC 


Time (sec) 


184.12 


192.42 


111.22 


110.38 


152.10 


150.19 


MSPE (ALL) 


1.80 


1.89 


2.02 


2.14 


1.95 


2.07 


MSPE (OBS) 


0.23 


0.28 


0.26 


0.35 


0.24 


0.34 


MSPE (MAR) 


1.66 


1.84 


1.69 


1.73 


1.61 


1.70 


MSPE (MBD) 


6.30 


6.38 


7.28 


7.61 


7.12 


7.31 


IS (ALL) 


4.76 


5.83 


4.97 


7.28 


4.82 


7.46 


IS (OBS) 


2.28 


2.69 


2.40 


2.93 


2.30 


2.89 


IS (MAR) 


5.65 


7.74 


5.45 


8.44 


5.08 


8.43 


IS (MBD) 


10.37 


11.87 


11.39 


17.68 


11.38 


18.67 


Posterior mean of r 


8.82 


9.78 


(8) 


(8) 


(14) 


(14) 



NPC = nonstationary parent covariance; SPC = stationary parent covariance; MSPE = mean 
square prediction error; IS = interval score; Time = average time for each MCMC (averaged over 
the 100 simulated datasets); ALL = all 512 locations; OBS = the 275 observed locations; MAR = 
the 137 missing-at-random locations; MBD = the 100 missing-by-design locations 



In Simulation Study 2 (see Table[3]), the NPC models worked better than the corresponding SPC 
models (as expected, because the true F(-) was nonstationary). Overall, random knots resulted in 
better predictions than fixed knots, especially for the (mis specified) SPC models. 

For Simulation Study 3 (see Table |4]), NPC still worked slightly better than SPC, suggesting 
that there is no penalty in terms of predictive distributions for using the (more flexible) NPC model 
when the true process is stationary. The models with random knots again had the best results. 

For all three simulation studies, the models with random knots resulted in longer computation 
times than the models with eight or 14 fixed knots. 



4.2 Analysis of Soil Readings from a Gamma-Radiometer 



We also compared the models of Section 4.1 using a large real- world spatial dataset. Viscarra 



Rossel et al. (2007) collected high-resolution soil information on Nowley farm in New South 
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Table 4: Results of Simulation Study 3 



r dlClll CUVoIlallCC 


Random knots 
NFC SPC 


8 Fixed knots 
NPC SPC 


14 Fixed knots 
NPC SPC 


lime i^sec ^ 


193 68 


196 75 


108 87 


109 51 


149.12 


148 85 


i\/rcT3TH /' A 1 1 ^ 


1 1 1 


1 1 9 


1 . JU 


1 S8 
1 .Jo 


1 

1 .L'-r 


1 .Z J 


IVlOr \^W-DvJ ) 


0.20 


0.20 


0.24 


0.24 


0.22 


0.22 


iVlor C (^iVl/\lVj 


0.40 


0.40 


0.55 


0.57 


0.47 


0.48 


MSPE (MBD) 


4.58 


4.62 


6.57 


6.67 


5.08 


5.10 


IS (ALL) 


4.12 


4.20 


5.07 


5.16 


4.68 


4.71 


IS (OBS) 


2.13 


2.14 


2.30 


2.31 


2.25 


2.26 


IS (MAR) 


3.12 


3.14 


3.62 


3.67 


3.38 


3.38 


IS (MBD) 


10.93 


11.33 


14.69 


15.04 


13.13 


13.27 


Posterior mean of r 


10.40 


10.69 


(8) 


(8) 


(14) 


(14) 



NPC = nonstationary parent covariance; SPC = stationary parent covariance; MSPE = mean 
square prediction error; IS = interval score; Time = average time for each MCMC (averaged over 
the 100 simulated datasets); ALL = all 512 locations; OBS = the 275 observed locations; MAR = 
the 137 missing-at-random locations; MBD = the 100 missing-by-design locations 



Wales, Australia. It is important to develop automated soil sensing for monitoring and precision 
agriculture, because conventional soil sampling is far too costly to be routinely used on a large 
scale. 

Specifically, I Viscarra Rossel et al.|(|2007]) obtained 34,266 gamma-ray readings using a gamma- 



radiometer mounted on the front of a four-wheel-drive vehicle. After some preprocessing, they 
smoothed the data using "local kriging" and carried out a multivariate calibration of the hyperspec- 
tral gamma-ray data to predict soil properties. They showed that "kriging improved the signal-to- 
noise ratio of the gamma-ray spectra." We focus here on spatial prediction of the total radioactivity 
count, the integrated count over the 0.4 - 2.81 mega-electronvolt spectrum, given in units of counts 
per second. The total count has been shown to be closely associated with the clay content in the 
soil ( [Taylor et alj |2002| [Praciho et alj |2004[ ). Previously, [Cressie and Kang| p010| ) carried out an 
exploratory data analysis of total count and obtained spatial predictions using a spatial-random- 
effects model. 

To assess prediction performance, we created a test region (called MBD) containing 409 obser- 
vations. The test data were only used for model evaluation, and they were not available for model 
fitting. The remaining n = 33, 866 measurements, together with the test region (MBD), are shown 
in the top left panel of Figure |2] The spatial domain was taken to be V := (223525, 225770) x 
(6526400, 6527930) in Easting and Northing. 



Following [Cressie and Kang| ( |2010| ), we log-transform the (shifted) data to obtain additive 
measurement error: 

Z{s,) := log(TC(s,) + 160), z = 1, . . . , n, (25) 

where TC denotes the total radioactivity count. 

Cressie and Kang ( 2010| ) identified Easting and Northing as important trend terms, and so 
we set x(s) := (l,s')', where each location s E V is a two-dimensional vector consisting of 
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X 10 X 10 




2.24 2,245 2.25 2.255 2.24 2,245 2,25 2.255 

X 10* X 10* 



Figure 2: Top left: Gamma emissions total count observations (small colored dots) and locations 
of the 25 basis-function centers for hg(-) (black circles). Top right: Posterior mean of the true 
intensity using our model. Bottom left: Posterior standard deviation of the true intensity. Bottom 
right: Posterior mean of the smooth process without S{-) (see text). The test region (MBD) is 
represented by a pink rectangle. Color-scale units are counts per second; Easting and Northing are 
given in meters. 



Easting and Northing (in meters). The measurement-error variance (on the log scale) is known 
from another experiment to have a value of 
empirical variance of Z was calculated to be (t| 



0.0016 (see Cressie and Kang, 2010). As the 



0.0026 (after subtracting the trend as estimated 
by ordinary least squares), the signal-to-noise ratio is less than 2. Nonetheless, it is possible to 
distinguish signal from noise in many areas of the domain due to high sampling density (see top 
left panel of Figure |2]). 

We considered two equidistant grids of fixed knots on the domain V, one with 64 and one with 
144 locations. For the vector he{-) in ( [15] ), we chose 25 power exponential functions with scale 
parameter 300 and centers shown in the top left panel of Figure |2j We chose a tapering length 
of L = 35 in (|9]), which resulted in roughly 150 nonzero elements per row for V in ( [T8| ) (i.e., 
about 99.56% of the elements of V were zero). The prior distributions of the parent-covariance 
parameters were as described in Section 2.4, with /i^r = log((Ty) and yU^ 

cry : = 



log (5 77. 76), where 



On an Intel Xeon X5560 machine with 94.5 GB RAM, we ran an MCMC for each of the 
models for 20,000 iterations, of which 10,000 were considered burn-in (based on examination 
of trace plots), and we only used every 10th of the remaining iterations for inference. We also 
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obtained the posterior distribution of F(-) at a grid of 5,707 locations. In Figure |2| using our 
model, we show the posterior means (top right panel) and standard deviations (bottom left panel) 
of the (error-free) true intensity (TI) on the original scale, defined in analogy to the transformation 
( |25| ) as TI(s) := exp{F(s)} — 160. We also show the posterior mean of exp{/i(-) + — 160 
(i.e., without 5{-)) in the bottom right panel of Figure |2] 

The model comparison was carried out on the log-scale. We obtained samples from the poste- 
rior distribution of Z{sj) at test location s^- as, Z^''^{sj) := Y'^''\sj) + e^''\sj), where the Y^''\sj) 
are posterior samples from Y(sj), and e^'^\sj) ~ A^(0, cr^) is independent "measurement error." 
We then calculated the average squared distance (ASD) of the means of {Z(^)(sj)} to the test 
observations Z{sj), and the interval score for 95% credible intervals for Z{sj), for all models, 
averaged over the test locations. 



Table 5: Summary of the results of the soil data analysis 





Random knots 


64 Fixed knots 


144 Fixed knots 


Parent covariance 


NPC 


SPC 


NPC 


SPC 


NPC 


SPC 


Time (hours) 


89.87 


95.05 


59.02 


59.15 


158.84 


152.36 


ASD (MBD) xlOO 


0.26 


0.28 


0.27 


0.28 


0.32 


0.30 


IS (MBD) X 100 


26.96 


28.13 


27.39 


28.48 


29.41 


31.38 


Posterior mean of r 


35.57 


42.88 


(64) 


(64) 


(144) 


(144) 



NPC = nonstationary parent covariance; SPC = stationary parent covariance; ASD = average 
squared distance; IS = interval score; Time = total time for the MCMC; MBD = 

missing-by-design (test region) 



The results are shown in Table |5j Random knots resulted in lower average squared distance 
and interval score than fixed knots. With the exception of average squared distance for the models 
with 144 fixed knots, NPC also improved over SPC. More knots resulted in less accurate predictive 
distributions. 

Based on examination of trace plots, mixing was somewhat slower for random knots than for 
fixed knots and slower for NPC than for SPC (both for the soil data here and for the simulated 
data in Section [4~T] ). Because the same number of MCMC iterations was used for all models, the 
computation times given in Tables |2]-j5]can be slightly misleading. However, since the focus in this 
article is not parameter estimation but on prediction, and predictive performance of the models was 
also assessed based on an equal number of MCMC iterations, we feel that the comparison is fair. 



5 Conclusions 



In this article, our starting point was the|Sang and Huang ( 2012| ) approach to analyzing large spa 



tial datasets, which combines a low-rank predictive-process component with a tapered remainder 
component. To achieve enough flexibility for the nonstationary processes often encountered in 
real- world applications, we extended this model in two ways: First, the components in the model 
are parameterized based on a nonstationary Matem parent covariance function, in which the pa- 
rameters vary spatially according to linear combinations of spatial basis functions. Second, for the 
low-rank component, which can be written as a linear combination of spatial basis functions, we 
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make inference on the number, locations, and shapes of the basis functions. Posterior inference via 
reversible jump MCMC and related issues are described in detail. 

The results of a simulation study (Section 4. 1 1 and an analysis of a very large soil dataset 
(Section 4.2 1 indicate that the two extensions described above can result in improved predictive 
distributions, especially in terms of quantifying prediction uncertainty. We show that for (typically 
nonstationary) real-world processes, it should often not be the goal to approximate a simple co- 
variance model (e.g., the stationary Matern covariance) as closely as possible. Results indicate that 
our model is sufficiently flexible to overcome a misspecified parent covariance, and its flexibility 
does not seem to result in a penalty in the unlikely event that the truth is, in fact, a simple stationary 
covariance (see Simulation Study 3 in Section 4.1 1. Due to its adaptability, our model can be used 
to model highly nonstationary processes with varying levels of smoothness. 
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